$$\newcommand{\expected}2{\mathbb{E}_{#1}\left[ #2 \right]}
\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}
\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}
\newcommand{\Dkl}2{D_{\mathrm{KL}}\left( #1 | #2 \right)}
\newcommand{\muvec}{\boldsymbol \mu}
\newcommand{\sigmavec}{\boldsymbol \sigma}
\newcommand{\uttid}{s}
\newcommand{\lspeakervec}{\vec{w}}
\newcommand{\lframevec}{\vec{z}}
\newcommand{\lframevect}{\lframevec_t}
\newcommand{\inframevec}{\vec{x}}
\newcommand{\inframevect}{\inframevec_t}
\newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T}
\newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T}
\newcommand{\model}2{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}
\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}
\newcommand{\normalparams}2{\mathcal{N}(#1,#2)}
\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}
\newcommand{\hidden}1{\vec{h}^{(#1)}}
\newcommand{\pool}{\max}
\newcommand{\hpooled}{\hidden{\pool}}
\newcommand{\Weight}1{\mathbf{W}^{(#1)}}
\newcommand{\Bias}1{\vec{b}^{(#1)}}$$
I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.
While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:
$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$
An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the `recognition model` or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:
$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$
Continue reading →